{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Homework 2\n", "\n", "**Assigned**: March 5, 2024. **Due**: March 12, 2024 at 2:00pm Eastern. **Note**: Submissions received after 2:00pm Eastern on March 19, 2024 will receive no credit.\n", "\n", "**Submitting**: Upload your submission on Gradescope as a `.pdf`. Converting to a PDF can be a complicated process, and so we encourage you to test this process well in advance of the submission deadlines. We recommend converting to HTML, opening the HTML file in a browser, and then printing or exporting to a PDF from your browser. We do not recommend directly converting to a PDF, since this requires installing xelatex. To convert to HTML in VSCode, press `ctrl+shift+p` and type `export`, and you should see an option to export to HTML.\n", "\n", "**Note**: Keep your `.ipynb` file, as we may request it directly (via email).\n", "\n", "**Note**: When converting to a PDF file, ensure that all of your code cells have been executed. The results of these executions *must* be included in your submitted PDF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instructions\n", "\n", "Complete the questions below, replacing the blue text with your own answers (your answers do not need to remain in blue). Do **not** modify the green text. Try to answer the questions without consulting your notes or any online material. If you cannot, then consult your notes, and if absolutely necessary, consult course materials (slides, notebooks) and/or Wikipedia. Do **not** use other sources or tools like ChatGPT. Complete this part of the assignment on your own (do **not** work with others).\n", "\n", "After you have completed all of the questions, at the bottom of this assignment you will find a link to another notebook, `Homework 2 Solutions.ipynb`. This contains the solutions, and instructions for ensuring that your answers are correct and sufficient. Make another pass through your homework assignment, replacing the green text with descriptions of what you missed for each question, and providing the fixes necessary to make your answer correct. **The solutions file may include additional instructions, which may include additional content to respond to even if you got a question correct (e.g., additional reflection).** During this second stage where you are filling in your answers, replacing the green text, you may reference the solutions, work with others, and use any tools (including ChatGPT).\n", "\n", "You will only submit this assignment once after replacing both the blue and green text. You do not need to submit the assignment between the first and second passes. Grading for each question will be based on whether you followed this process, and arrived at the correct answers and have sufficient discussion/text in the end. Points will be deducted if you did not make a reasonable effort to answer the question initially, if your final answer remains incorrect, of if your answers were not sufficiently clear (so, write in full sentences with proper punctuation, and conveying your arguments clearly). Other than verifying that you made a reasonable initial effort for your initial answers (blue), points will **not** be deducted due to *initial* answers being incorrect. Hence, there is no reason to break the rules to obtain correct answers initially." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Short Answer\n", "\n", "Answer the following questions with at least a few sentences, and no more than roughly one page of text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. [10 points] What is the the difference between a parametric and a non-parametric method in machine learning?\n", "\n", "***Initial Answer***\n", "\n", "Replace this text with your answer.\n", "\n", "---\n", "\n", "***Updated Answer***\n", "\n", "Replace this text with your response to the solution document.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. [20 points] State whether the following parametric model is a linear parametric model. Explain your answer.\n", "\n", "Let $g_v$ be a linear parametric model with weights $v \\in \\mathbb R^m$ and basis $\\phi$. We will create a new parametric model, $f_w$, with weights $w \\in \\mathbb R^{2m}$. Let $w=[v,v']$, i.e., let $w$ be the concatenation of two different weight vectors $v \\in \\mathbb R^m$ and $v' \\in \\mathbb R^m$. Then, let $f_w$ be defined as:\n", "$$\n", "f_w(x_i) = \\max\\Big \\{ g_{v}(x_i), g_{v'}(x_i)\\Big \\}.\n", "$$\n", "\n", "Is this parametric model, $f_w$, a linear parametric model? Why or why not?\n", "\n", "**Note**: You may use any online resources (e.g., ChatGPT) to help you format equations using LaTeX.\n", "\n", "***Initial Answer***\n", "\n", "Replace this text with your answer.\n", "\n", "---\n", "\n", "***Updated Answer***\n", "\n", "Replace this text with your response to the solution document.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. [10 points] How can you estimate the the mean squared error (MSE) of a particular parametric ML model $f_w$ given access to data, $D=(X_i,Y_i)_{i=1}^n$?\n", "\n", "**Note**: Here the weights $w$ have already been determined independent of the data $D$. You may write $f_w(x)$ to denote the prediction made by the model when given input $x$.\n", "\n", "***Initial Answer***\n", "\n", "Replace this text with your answer.\n", "\n", "---\n", "\n", "***Updated Answer***\n", "\n", "Replace this text with your response to the solution document.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. [10 points] How can you estimate the mean squared error (MSE) that would result if you used a specific ML algorithm `alg` to train a model using $m$ data points? \n", "\n", "You may assume access to $n>m$ data points. Your answer may be short - specifying a method or approach that we have discussed and why it is appropriate.\n", "\n", "***Initial Answer***\n", "\n", "Replace this text with your answer.\n", "\n", "---\n", "\n", "***Updated Answer***\n", "\n", "Replace this text with your response to the solution document.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. [10 points] Imagine that you have a current model for a regression problem with mean squared error (MSE) $2.0$. If we have $n=150$ data points, we train a model using $100$ and compute the sample mean squared error (MSE) on the other $50$ points, and we obtain a sample MSE of $1.2$, can we conclude that the new model is better (has lower MSE) than the current model? Why or why not?\n", "\n", "***Initial Answer***\n", "\n", "Replace this text with your answer.\n", "\n", "---\n", "\n", "***Updated Answer***\n", "\n", "Replace this text with your response to the solution document.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6. [10 points] Derive the gradient descent update equations for the following loss function.\n", "\n", "Let $L(w,D)=\\sum_{i=1}^n 1 + \\sin(2 \\pi (y_i - \\hat y_i - 1/4))$, where $y_i$ is the label associated with the $i^\\text{th}$ point and $\\hat y_i$ is the model's prediction of $y_i$. The code snippet below this question creates a plot of this loss function. Notice that this loss function indicates that the lass associated with an error scales with its distance from the closest integer. Errors equal to integer values result in no loss at all, while errors half way between integers result in the largest loss.\n", "\n", "Consider a linear parametric model:\n", "$$\n", "f_w(x_i) = \\frac{1}{n}\\sum_{j=1}^m w_j \\phi_j(x_i),\n", "$$\n", "where $\\phi(x_i) \\in \\mathbb R^m$. Derive the gradient update rule for $w_j$. Your final answer should not include any derivative or gradient symbols (work out what these derivatives/gradients are).\n", "\n", "**Note:** Remember that $\\frac{\\partial}{\\partial x} \\sin(f(x)) = \\cos(f(x)) \\frac{\\partial}{\\partial x} f(x)$.\n", "\n", "**Note:** This problem is worth 2-4 times the number of points of previous problems, and should take a correspondingly longer amount time to complete. I recommend working this out first with pencil and paper, double checking your derivatives, and then typing your answer.\n", "\n", "***Initial Answer***\n", "\n", "Replace this text with your answer.\n", "\n", "---\n", "\n", "***Updated Answer***\n", "\n", "Replace this text with your response to the solution document.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Define the function\n", "def f(x):\n", " return 1 + np.sin(2 * np.pi * (x - 0.25))\n", "\n", "x = np.linspace(-2, 2, 400)\n", "y = f(x)\n", "plt.plot(x, y)\n", "plt.xlabel('y_i - \\hat y_i')\n", "plt.grid(True)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Programming\n", "\n", "The code below defines `evaluate_features`, which performs $k$-fold cross-validation to estimate the MSE that results when a linear parametric model is trained on the provided data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import KFold\n", "from sklearn.metrics import make_scorer, mean_squared_error\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import Pipeline\n", "\n", "# Function to perform k-fold cross-validation for regression\n", "def evaluate_features(X, y, n_splits=50, random_state=42):\n", " \"\"\"\n", " Evaluate features using k-fold cross-validation for regression.\n", "\n", " Parameters:\n", " X (DataFrame): The feature set.\n", " y (Series): The labels (continuous values).\n", " n_splits (int): The number of folds for cross-validation.\n", " random_state (int): The seed used by the random number generator.\n", "\n", " Returns:\n", " float: The average MSE across all folds.\n", " \"\"\"\n", " # Create a pipeline with a standard scaler and a linear regression model\n", " pipeline = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('model', LinearRegression())\n", " ])\n", "\n", " # Define the k-fold cross-validation method\n", " kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)\n", "\n", " # Perform cross-validation and calculate MSE\n", " scores_mse = cross_val_score(pipeline, X, y, cv=kf, scoring=make_scorer(mean_squared_error))\n", "\n", " return scores_mse.mean()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below loads the GPA data and uses `evaluate_features` to determine the MSE likely to result from trainin a linear model on this data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average MSE: 0.5820102673423404\n" ] } ], "source": [ "# Load the data set\n", "df = pd.read_csv(\"https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv\", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas\n", "\n", "# Split the data into features and labels\n", "X = df.iloc[:, :-1]\n", "y = df.iloc[:, -1]\n", "\n", "# Evaluate the features\n", "average_mse = evaluate_features(X, y)\n", "print(f\"Average MSE: {average_mse}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the use of a fixed random number seed, each time you run this you should obtain the same result:\n", "> Average MSE: 0.5820105070662254\n", "\n", "#### 1. [20 points] In the code block below, perform **feature engineering** to come up with additional features that improve the predictions, reducing the reported average MSE to at most $0.573$.\n", "\n", "**Note**: This may require experimenting with many different additional features that can be added! Try out many different features to see how low you can get the average MSE.\n", "\n", "**Note**: There exist features that achieve an average MSE *less than* $0.57$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Initial Answer***\n", "\n", "Delete this line and enter your initial answer in the code cell below.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average MSE: 0.582160173827006\n" ] } ], "source": [ "# NOTE: We recommend leaving these initial lines, so that if you re-run this cell many times you don't append multiple new features to the data set unintentionally\n", "df = pd.read_csv(\"https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv\", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas\n", "X = df.iloc[:, :-1]\n", "y = df.iloc[:, -1]\n", "\n", "####################################################################################\n", "# ENTER YOUR INITIAL ANSWER HERE\n", "# Replace the line below, which appends a feature that does not improve the fit.\n", "X['avg_stem_score'] = X[['physics', 'math', 'chemistry']].mean(axis=1)\n", "####################################################################################\n", "\n", "# Evaluate the features\n", "average_mse = evaluate_features(X, y)\n", "print(f\"Average MSE: {average_mse}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Updated Answer***\n", "\n", "Replace this text with your response to the solution document.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. [10 points] Reflect on this feature engineering process.\n", "\n", "Write a paragraph reflecting on the feature engineering process. What kinds of features did you try first? Did you change approaches or change the types of features you added? Did you try including features that were linear combinations of existing features (that is, weighted summations of existing features, like averages of a subset of the features)? Did you then make a deliberate effort to create new features that are not linear combinations of existing features? Why do you think the best features you found are effective?\n", "\n", "***Initial Answer***\n", "\n", "Replace this text with your answer.\n", "\n", "---\n", "\n", "***Updated Answer***\n", "\n", "Replace this text with your response to the solution document.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The solutions can be found here: [https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/Homework%202%20Solutions.ipynb](https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/Homework%202%20Solutions.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }